Skip to content

fix(daemon): sweep full workspace triple on startup and crash exit#239

Merged
chrisleekr merged 1 commit into
mainfrom
fix/issue-221
Jun 20, 2026
Merged

fix(daemon): sweep full workspace triple on startup and crash exit#239
chrisleekr merged 1 commit into
mainfrom
fix/issue-221

Conversation

@chrisleekr

Copy link
Copy Markdown
Owner

Summary

Fixes a filesystem + token-leak bug (#221) where three per-job paths under CLONE_BASE_DIR were orphaned after a pod SIGKILL, OOM kill, or spot eviction: the clone directory <workDir>, the installation-token-bearing credential helper <workDir>.cred.sh, and the sibling summary directory <workDir>-artifacts. Two independent gaps combined: registerExitCleanup and handleJobCancel in job-executor.ts removed only the clone dir + credential helper (the -artifacts dir was always leaked, even on graceful exit), and the daemon had no startup sweep at all — the only reaper lived in app.ts, ran in the webhook/orchestrator process (wrong filesystem in production), and was scoped to *.cred.sh files only.

Flow

flowchart TD
    Kill["SIGKILL or OOM or eviction<br/>also graceful exit and cancel"]:::trig
    Old["Before<br/>exit and cancel removed clone dir and cred.sh only<br/>app.ts reaper removed cred.sh only"]:::old
    Leak["Orphaned<br/>-artifacts dir always<br/>clone dir on daemon SIGKILL"]:::old
    New["After<br/>removeWorkspaceTripleSync removes all three<br/>on exit and cancel"]:::new
    Sweep["sweepStaleWorkspaces at startup<br/>reaps entries older than WORKSPACE_STALE_TTL_MS<br/>emits a workspace.sweep log line"]:::new
    Clean["No orphans accumulate"]:::keep
    Kill --> Old --> Leak
    Kill --> New --> Clean
    Sweep --> Clean
    classDef trig fill:#1f618d,color:#ffffff
    classDef old fill:#7b2d2d,color:#ffffff
    classDef new fill:#2c5f2e,color:#ffffff
    classDef keep fill:#2c3e50,color:#ffffff
Loading

Changes

  • New src/core/workspace-sweep.ts (parameter-based, config-free):
    • removeWorkspaceTripleSync(workDir) — synchronous best-effort removal of the full triple, for process.on("exit") / cancel handlers; self-guards the empty-workDir scoped-job invariant so no removal can target a CWD-relative .cred.sh/-artifacts.
    • sweepStaleWorkspaces(cloneBaseDir, ttlMs, log) — async TTL reaper; tolerant of a missing base dir and concurrent removal; emits one structured workspace.sweep log line with {swept, retained, durationMs}.
  • src/daemon/job-executor.ts — both registerExitCleanup and handleJobCancel now call removeWorkspaceTripleSync (the -artifacts sibling is now reaped on crash exit and cancel). Scoped-job guards (workDir === "") preserved.
  • src/daemon/main.tssweepStaleWorkspaces runs once at boot, before discoverCapabilities and the WebSocket connect, so no in-flight job of this lifetime can be reaped.
  • src/app.ts — the inline *.cred.sh-only reaper is replaced by the shared helper (now also sweeps clone dirs + artifacts); dead STALE_CRED_TTL_MS constant and now-unused imports removed.
  • src/config.ts — new workspaceStaleTtlMs from WORKSPACE_STALE_TTL_MS (default 3600000 = 1h, positive-int validated).
  • Docsconfiguration.md (env row), observability.md (workspace.sweep event), runbooks/daemon-fleet.md (startup sweep).
  • Teststest/core/workspace-sweep.test.ts: triple removal, absent-path no-op, empty-workDir no-op, aged-swept/fresh-retained (exact counts + workspace.sweep log assertion), partial-orphan stale-cred.sh independence, missing-base tolerance.

Why it's safe

cloneBaseDir is pod-local ephemeral storage (no shared volume in the ephemeral-daemon Pod spec) holding only ephemeral per-job workspaces — every job mkdtemps fresh; there is no persistent repo cache. The sweep runs at startup before any job is accepted, so the mtime-based per-entry reap cannot touch a live workspace.

Related issues

Closes #221

Test plan

  • test/core/workspace-sweep.test.ts — 6 cases, all pass in isolation
  • typecheck / lint / format / docs:build (mkdocs strict + doc-sync + citations) clean
  • Full bun test: stash-verified baseline shows this change adds passes with fail/error counts unchanged; remaining failures are pre-existing local DB/Valkey-absent suites. The isolated CI runner (scripts/test-isolated.sh) is the authoritative signal.

The daemon orphaned three per-job paths under CLONE_BASE_DIR on SIGKILL
(OOM/eviction/kubelet kill): the clone dir, the token-bearing
${workDir}.cred.sh, and ${workDir}-artifacts. registerExitCleanup and
handleJobCancel removed only the first two even on graceful exit, and the
daemon had no startup sweep (the only reaper lived in app.ts, ran in the
wrong process, and matched *.cred.sh only).

Add src/core/workspace-sweep.ts: removeWorkspaceTripleSync (sync, for
exit/cancel handlers) and sweepStaleWorkspaces (startup TTL reaper, emits
a workspace.sweep log). Wire into both job-executor cleanup sites, daemon
main() startup, and app.ts (replacing its divergent cred-only sweep). New
WORKSPACE_STALE_TTL_MS env (default 1h).

Closes #221

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015v2bspPF1ZTTG9ZZrbBgA1
Copilot AI review requested due to automatic review settings June 20, 2026 12:25
@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@chrisleekr, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 35 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2bebe994-9025-4557-968d-3e14a872bf06

📥 Commits

Reviewing files that changed from the base of the PR and between 675d610 and 5230305.

📒 Files selected for processing (9)
  • docs/operate/configuration.md
  • docs/operate/observability.md
  • docs/operate/runbooks/daemon-fleet.md
  • src/app.ts
  • src/config.ts
  • src/core/workspace-sweep.ts
  • src/daemon/job-executor.ts
  • src/daemon/main.ts
  • test/core/workspace-sweep.test.ts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chrisleekr chrisleekr merged commit 56fa714 into main Jun 20, 2026
16 checks passed
@chrisleekr chrisleekr deleted the fix/issue-221 branch June 20, 2026 12:29

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes daemon workspace leaks by ensuring all three per-job paths under CLONE_BASE_DIR (clone dir, ${workDir}.cred.sh, and ${workDir}-artifacts) are cleaned up on cancel/exit and are swept on process startup via a TTL-based reaper.

Changes:

  • Added src/core/workspace-sweep.ts with removeWorkspaceTripleSync() (exit/cancel) and sweepStaleWorkspaces() (startup TTL sweep + structured log).
  • Updated daemon and webhook startup to run the shared startup sweep, and updated daemon cancel/exit handlers to remove the full workspace triple.
  • Added unit tests and updated docs/config for the new WORKSPACE_STALE_TTL_MS knob and the workspace.sweep log event.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/core/workspace-sweep.ts Introduces shared helpers to remove full workspace triples and sweep stale entries with TTL + structured logging.
src/daemon/job-executor.ts Replaces ad-hoc rmSync cleanup with removeWorkspaceTripleSync for exit and cancel paths.
src/daemon/main.ts Runs sweepStaleWorkspaces at daemon boot before capability discovery / WS connect.
src/app.ts Replaces the previous *.cred.sh-only reaper with the shared sweepStaleWorkspaces.
src/config.ts Adds validated workspaceStaleTtlMs sourced from WORKSPACE_STALE_TTL_MS.
test/core/workspace-sweep.test.ts Adds unit coverage for triple removal, TTL sweep behavior, and logging contract.
docs/operate/configuration.md Documents WORKSPACE_STALE_TTL_MS.
docs/operate/observability.md Documents the workspace.sweep log event and fields.
docs/operate/runbooks/daemon-fleet.md Notes daemon startup sweep behavior and operational expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +4 to +6
* RED phase: the module under test does not exist yet, so the import below
* fails at resolution time and every case errors. Production code is written
* in the GREEN phase.
Comment on lines +81 to +84
} catch {
// Base dir unreadable / uncreatable: nothing to sweep.
return { swept: 0, retained: 0, durationMs: Date.now() - startedAt };
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(pipeline): daemon leaks workDir + artifacts dirs on SIGKILL; no startup sweep, exit handler skips artifacts

2 participants